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(54) Executing speculative parallel instruction threads 



(57) A central processing unit (CPU) in a computer 
that permits speculative parallel execution of more than 
one instruction thread. The CPU uses FORK-SUS- 
PEND instructions that are added to the instruction set 
of the CPU, and are inserted in a program prior to run- 
time to delineate potential future threads for parallel 
execution. The CPU has an instruction cache with one 
or more instruction cache ports, a bank of one or more 



program counters, a bank of one or more dispatchers, a 
thread management unit that handles inter-thread com- 
munications and discards future threads that violate 
dependencies, a set of architectural registers common 
to ail threads, and a scheduler that schedules parallel 
execution of the instructions on one or more functional 
units in the CPU 
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This invention relates to the field of executing parallel threads of instructions on a computer system (THREAD: A 
sequence of instructions executable using a single instruction sequencing control (implying, single program counter) 
5 and a shared set of architecturally visible machine state). More specifically, the invention relates to determining which 
threads of instructions in a computer program can be executed in parallel and accomplishing this parallel execution in 
a speculative manner. 

Commercially available microprocessors currently have a uniprocessor architecture. This architecture may include 
one or more functional units (branch unit load/store unit integer arithmetic unit floating-point arithmetic unit etc.) that 

10 share a common set of architecturally visible registers. (A register is considered architecturally visible if it is accessible 
to the assembly level programmer of the processor or to the compiler of the processor that translates a higher level pro- 
gram to the assembly level of the machine.) 

In computer systems, instructions generated using a compiler or assembly programmer, are placed in a sequence 
in an instruction memory, prior to run time, from where they can be fetched for execution. This sequence is called the 

75 static order. A dynamic order is the order in which the computer executes these instructions. The dynamic order may or 
may not be the static order. (In the discussion to follow, the phrase compile time is used to refer to the timing of any prior- 
to-run-time processing. Note however that although such a processing is very likely to be carried out by a compiler, 
other means, such as, assembly level programming, could also be employed instead.) 

Prior art scalar computers, i.e., non-superscalar computers, or machines that execute instructions one at a time, 

20 have a unique dynamic order of execution that is called the sequential trace order (SEQUENTIAL TRACE ORDER: The 
dynamic order of execution sequence of program instructions, resulting from the complete execution of the program on 
a single-control-thread, non-speculative machine that executes instructions one-at-a-tim). Let an instruction A precede 
another instruction B in the sequential trace order. Such an instruction A is also referred to as an earlier instruction with 
respect to B These computers execute instructions in their static order until a control instruction is encountered. At this 

25 point instructions may be fetched from a (non-consecutive) location that is out of the original sequential order. Then 
instructions are again executed in the static sequential order until the next control instruction is encountered. Control 
instructions are those instructions that have the potential of altering the sequential instruction fetch by forcing the future 
instruction fetches to start at a non-consecutive location. Control instructions include instructions like branch, jump, etc. 
Some prior art machines can execute instructions out of their sequential trace order if no program dependencies 

30 are violated. These machines fetch instructions sequentially in the sequential trace order or fetch groups of instruction 
simultaneously in the sequential trace order. However, these machine do not fetch these instructions out of their 
sequential trace order. For example, if instruction A precedes instruction B in the sequential trace order, prior art 
machines can sequentially fetch instruction A then B a simultaneously fetch instruction A with B but do not fetch 
instruction B before A. Such a restriction is characteristic of machines with a single program counter. Therefore, 

35 machines with such constraints are said to be single thread or uni-thread machines. They are unable to fetch instruc- 
tions later in the sequential trace order before fetching prior instructions in the sequential trace order. 

All of the current generation of commercial microprocessors known to the inventors have a single thread of control 
flow. 

Such processors are limited in their ability to exploit control and data independence of various portions of a given 
40 program. Some of the important limitations are fisted below: 

o Single thread implies that the machine is limited to fetching a single sequence of instructions and is unable to pur- 
sue multiple flows (threads) of program control simultaneously. 

45 o Single-thread control further implies that data independence can only be exploited if the data-independent instruc- 
tions are close enough (eg., in a simultaneous fetch of muftple instructions into the instruction buffer) in the thread 
to be fetched close together in time and examined together to detect data independence. 

o The limitation above in turn implies reliance on compiler to group together control independent and data-independ- 
50 ent instructions. 

o Some prior art microprocessors contain some form of control instruction (branch) prediction, called control-flow 
speculation. Here an instruction following a control instruction in the sequential trace order may be fetched and exe- 
cuted in the hope that the control instruction outcome has been correctly guessed. Speculation on control flow is 
55 already acknowledged as a necessary technique for exploiting higher levels of parallelism. However, due to the lack 
of any knowledge of control dependence, single-thread dynamic speculation can only extend the ability to look 
ahead until there is a control flow mis-speculation (bad guess). A bad guess can cause a waste of many execution 
cycles. It should be noted that run-time learning of control dependence via single thread control flow speculation is 
at best limited in scope, even if the hardware cost of control-dependence analysis is ignored. Scope here refers to 



3 



BNSDOCID: <EP 0725334A1J_> 



EP0725334A1 



the number of instructions that can be simultaneously examined for the inter-instruction control and data depend- 
encies. Typically, one can afford a much larger scope at compile time than at run time 

o Compile-time speculation on control flow, which can have much larger scope than run-time speculation, can also 
5 benefit from control-dependence analysis. However, the run-time limitation of a single thread again requires the 
compiler to group together these speculative instructions along with the non-speculative ones, so that the parallel- 
ism is exploitable at run time. 

The use of compile-time control flow speculation to expose more parallelism at run time has been mentioned 
10 above. Compilers of current machines are limited in their ability to encode this speculation. Commonly used 
approaches, such as guarding and boosting, rely on the compiler to percolate some instructions to be speculatively exe- 
cuted early in the single thread execution. They also also require that the control flow speculation be encoded in the 
speculative instruction. This approach has the following important limitations: 

is o It is typically very difficult to find enough unused bits in every instruction to encode even shallow control flow spec- 
ulations. Note that due to backward compatibility constraints (ability to run old binaries, without any translation), 
instruction encoding cannot be arbi trarily rearranged (implying new architecture) to include the encoding of control 
flow speculation. 

20 o The percolation techniques mentioned above often require extra code and/or code copying to handle mis-specuJa- 
tion. This results in code expansion, 

o Sequential handling of exceptions raised by the speculative instructions and precise handling of interrupts are often 
architecturally required. However, implementing these in the context of such out-of-order speculative execution is 

25 often quite difficult due to the upward speculative code motion used by the percolation techniques mentioned 
above. Special mechanisms are needed to distinguish the percolated instructions and to track their original loca- 
tion. Note that from the point of view of external interrupts, under the constraints of precise handling of interrupts, 
any instruction execution out of the sequential trace order, may be viewed as speculative. However, in a restricted 
but more widely used sense, an execution is considered speculative if an instruction processing is begun before 

30 establishing that the instruction (more precisely, the specific dynamic instance of the instruction) is part of the 
sequential trace order, or if operands of an instruction are provided before establishing the validity of the operands. 

Ignorance of control dependence can be especially costly to performance in nested loops. For example, consider 
a nested loop, where outer iterations are control and data independent of data dependent inner loop iterations. If know!- 

35 edge of control and data independence of outer loop iterations is not exploited, their fetch and execution must be 
delayed, due to the serial control flow speculation involving the inner loops. Furthermore, due to this lack of knowledge 
of control dependence, speculatively executed instructions from an outer loop may unnecessarily be discarded on the 
misprediction of one of the control and data independent inner loop iterations. Also, note that the probability of mispre- 
diction on the inner loop control flow speculation can be quite high in cases where the inner loop control f tow is data 

40 dependent and hence quite unpredictable. One such example is given below. 

/* check the environment list */ 



for (fp = xlenv; fp; fp = cdr (fp) ) 



for (ep = car (fp); ep; ep « cdr (ep) ) 

so 

if (sym «« car (car (ep) ) ) 



cdr (car (ep) ) « newjp; 



This is a doubly nested loop, where the inner loop traverses a linked list and its iterations are both control and data 
dependent on previous iterations. However each activation of the inner loop (i.e., the outer loop iterations) is independ- 
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ent of the previous one. [This is a slightly modified version of one of the most frequently executed loops (Xlgetvalue) in 
one of the SPECint92 benchmarks (Li).] 

As explained above, machines with single control flow have to rely on the compiler to group together speculative 
and/or non-speculative data-independent instructions. However, to group together all data and control independent 
s instructions efficiently, the compiler needs enough architected registers for proper encoding. Therefore, register pres- 
sure is increased and beyond a point such code motion becomes fruitless due to the overhead of additional spill code. 

Some research attempts have been made to build processors with multiple threads, primarily aimed at implement- 
ing massively parallel architectures. The overhead of managing multiple threads can potentially outweigh the perform- 
ance gains of additional concurrency of execution. Some of the overheads associated with thread management are the 
10 following: 

o Maintaining and communicating the partial order due to data and control dependence, through explicit or implicit 
synchronization primitives. 

15 o Communicating the values created by one thread for use by another thread. 

o Trade-offs associated with static, i.e., compile-time, thread scheduling versus dynamic, i.e., run-time, thread sched- 
uling. Static thread scheduling simplifies run-time hardware, but is less flexible and exposes the thread resources 
of a machine implementation to the compiler, and hence requires recompibtion for different implementations. On 
20 the other hand, dynamic thread scheduling is adaptable to different implementations, all sharing the same execut- 
able, but h requires additional run-time hardware support. 

An object of this invention is an improved method and apparatus for simultaneously fetching and executing different 
instructions threads. 

25 An object of this invention is an improved method and apparatus for sffnuftaneously fetching and executing different 
instruction threads with one or more control and data dependencies. 

An object of this invention is an improved method and apparatus for simultaneously fetching and speculatively exe- 
cuting different instruction threads with one or more control and data dependencies. 

An object of this invention is an improved method and apparatus for simultaneously fetching and speculatively exe- 
so cuting different instruction threads with one or more control and data dependencies on different implementations of the 
computer architecture. 

The present invention is an enhancement to a central processing unit (CPU) in a computer that permits speculative 
parallel execution of more than one instruction thread. The invention discloses novel FORK-SUSPEND instructions that 
, are added to the instruction set of the CPU, and are inserted in a program prior to run-time to delineate potential future 

35 threads (MAIN VS. FUTURE THREADS: Among the set of threads at any given time, the thread executing the instruc- 
tion earliest in the sequential trace order, is referred to as the main thread. The remaining threads are referred to as 
future threads) for parallel executioa Preferably, this is done by a compiler. 

Tbe CPU has an instruction cache with one or more instruction cache ports and a bank of one or more program 
counters that can independently address the instructions in the instruction cache. When a program counter addresses 

40 an instruction, the addressed instruction is ported to an instruction cache port The CPU also has one or more cfispatch- 
ers. A dispatcher receives the Instructions ported to an instruction cache port in an instruction buffer associated with 
the dispatcher. The dispatcher also analyzes the dependencies among the instructions in its buffer. A thread manage- 
ment unit in the CPU handles any inter-thread communication and discards any future threads that violate program 
dependencies. A CPU scheduler receives instructions from all the dispatchers in the CPU and schedules parallel exe- 

45 cution of the instructions on one or more functional units in the CPU. Typically, one program counter will track the exe- 
cution of the instructions in the main program thread and the remaining program counters will track the parallel 
execution of the future threads. The porting of instructions and their execution on the functional units can be done spec- 
ulatively. 

Figure 1 is a block diagram of the hardware of a typical processor organization that would execute the present 
so method. Figure 2 is a flow chart showing the steps of the present method. 

Figures 3a through 3i f are a set of block diagrams showing the format structures of FORK, UNCOND_SUSPEND 
SUSPEND, SKIP, FSKIP, SKPMG. FORK.SUSPEND. FORK_S_SUSPEND, and, FORKJWSUSPEND instructions. ' 

Figures 4a through 4d, are a set of block diagrams showing a preferred embodiment of the encoding of the format 
structures of the FORK, UNCOND_SUSPEND, SUSPEND, and. FORK.SUSPEND instructions. 
55 Figure 5 illustrates the use of some of the instructions proposed in this invention, in samples of assembly code. 

Figure 6 also illustrates the use of some of the instructions proposed in this invention, in samples of assembly code. 

This invention proposes FORK-SUSPEND instructions to enhance a traditional single-thread, speculative super- 
scalar CPU to simultaneously fetch, decode, speculate, and execute instructions from multiple program locations, thus 
pursuing multiple threads of control. 
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Figure 1 is a block diagram of the hardware of a typical processor organization that would execute the method of 
execution proposed in this invention. The method of execution is described later. The detailed description of Figure 1 
follows. 

Block 100 is a memory unit of the central processing unit (CPU) of the processor which holds program data and 
5 instructions intended for execution on the processor. This memory unit is interlaced with the cache units, such that the 
frequently used instruction and data portions of the memory unit are typically kept in an instruction cache unit (Block 
1 10) and a data cache unit (Block 1 70), respectively. Alternatively, the instruction and data caches can be merged into 
a single unified cache. Access time for the cache unit is typically much smaDer than that of the memory unit. Memory 
and cache units such as these are well known in the art For example, the cache unit can be replaced by using main 
10 memory and its ports for the cache memory and its ports. Cache can also be comprised of multiple caches or caches 
with one or more levels, as is well known. 

Block 1 10 is an instruction cache unit of the processor (CPU) which holds program instructions which are intended 
for execution on the processor. These include the new instructions proposed in this invention, such as, FORK, SKIP, 
SUSPEND, UNCOND_SUSPEND (Block 112). The detailed semantics of these and other new instructions are 
is described later. 

Block 115 containing the multiple ports P1 , P2 PN (Blocks 115-1, 115-2, ... 1 15-N), of the instruction cache 

is new to the current art The multiple ports enable simultaneous porting of instructions to the instruction threads being 
executed in parallel. Alternatively, one could port multiple instructions to a certain thread using a single wide port and 
while that thread is busy executing the ported instructions, the same port could be used for porting multiple instructions 
20 to another thread. 

Block 120 is a bank of program counters, PC1. PC2 PCN (Blocks 120-1, 120-2, . . . 120-N)- These counters 

can be any counter that is well known in the art. Each program counter tracks the execution of a certain thread. Afl of 
the commerce CPUs designed to this date have only had to control the execution of a single instruction thread, for a 
given program. Hence, the current and previous art has been Grrdted to single program counter, and the bank of multiple 

25 program counters is thus a novel aspect of this invention. Each program counter is capable of addressing one or more 
consecutive instructions in the instruction cache. In the preferred embodiment depicted in the block diagram of Figure 
1. each program counter is associated with an instruction cache port. Alternatively, different program counters can 
share an instruction cache port 

Furthermore, in our preferred embodiment a specific program counter is associated with the main thread, and the 

so remaining program counters track the execution of the future threads. In Figure 1 , PC1 (Block 1 20-1), is the main thread 
program counter. The remaining program counters are referred to as the future thread program counters (Block 120-2, 
. . . 120-N). 

Block 1 30 refers to a novel thread management (TM) unit which is responsible for executing the new instructions 
which can fork a new thread, and handling inter-thread communication via the merge process (described later). 

35 This unit is also capable of discarding some or all instructions of one or more future threads. This unit is further 
capable of determining whether one or more instructions executed by any of the future threads need to be discarded 
due to violations of program dependencies, as a consequence of one or more speculations. If a speculation is made at 
run tone, it is communicated to the TM unit by the speculating unit For example, any speculation of branch instruction 
outcome in the cfispatcher block (Block 1 40 described later) needs to be oornmunicated to the TM unit If any specula- 

40 tion is made at compile tone and encoded in an instruction, it is also communicated to the TM unit by the dispatcher in 
Block 140, that decodes such an instruction. The resulting ability to execute multiple threads speculatively is a unique 
feature of this invention. 

Also note that the parallel fetch and execution of main and future threads implies that the proposed machine can 
fetch and execute instructions out of their^ sequential trace order. This unique characteristic of this machine distin- 
45 guishes it from the prior art machines, which are unable to fetch instructions out of their sequential trace order due to 
single program count a". 

Block 140 refers to a bank of dispatchers, Dispatcher^ , Dispatcher-2, . . . Dispatcher-N (Blocks 140-1, 140-2, . . . 
140-N), where each dispatcher is associated with a specific program counter and thus capable of receiving instructions 
from one of the instruction cache ports in an instruction buffer associated with the cfispatcher (Blocks 141-1, 141-2, . . 
so . 1 41 -N). A dispatcher is also capable of decoding and analyzing dependencies among the instructions in its buffer. The 
dispatcher is further responsible for implementing the semantics of the SKIP, FSKIR or SKPMG instructions described 
later. 

The instructions encountered by a dispatcher, which can fork or suspend a thread, are forwarded to the thread 
management unit (Block 1 30). The TM unit is responsible for activating any future thread dispatcher by loading appro- 
55 priate starting instruction in the corresponding program counter. The TM unit also suspends a future thread dispatcher 
on encountering an UNCOND_SUSPEND instruction. 

The implementation techniques of run-time dependence analysis for out-of-order execution are well known in prior 
art The dispatcher associated with the main program counter, and hence with the main thread, is referred to as the 
main thread dispatcher. In Figure 1 , Dispatcher-! (Block 140-1) is the main thread dispatcher. The remaining dispatch- 
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ers (Blocks 140-2, . . . 140-N), are associated with the fu- ture program counters and future threads, and are referred 
to as the future thread dispatchers. 

A novel aspect of the bank of dispatchers proposed in this invention is that the run-time dependence analysis of the 
instructions in one dispatcher's buffer can be carried out independent of (and hence in parallel) with that of any other 

s dispatcher. This is made possible by the compile-time dependence analysis which can guarantee the independence of 
the instruction threads under specified conditions. Thus, on the one hand, the run-time dependence analysis benefits 
from the potentially much larger scope of the compile-time analysis (large scope refers to the ability of examining large 
number of instructions simultaneously for their mutual dependence). On the other hand, the compile-time analysis ben- 
efits from the fork-suspend mechanism, which allows explicit identification of independent threads with speculation on 

10 run-time outcomes. The dependence analysis techniques for run-time or compile-time are well known in the prior art, 
however, the explicit speculative communication of the compile-time dependence analysis to the run-time dependence 
analysts hardware, is the novelty of this invention. 

Block 150 is a scheduler that receives instructions from ail the dispatchers in the bank of dispatchers (Block 140), 
and schedules each instruction for execution on one of the functional units (Block 180). All the instructions received In 

is the same cycle from one or more dispatchers are assumed independent of each other. Such a scheduler is also well 
known in prior art for superscalar machines. In an alternative embodiment, the scheduler could also be split into a set 
of schedulers, each controlling a defined subset of the functional units (Block 180). 

Block 160 is a register file which contains a set of registers. This set is further broken down into architecturally vis- 
ible set of registers and architecturally invisible registers. Architecturally visible, or architected registers refer to the fixed 
- 20 set of registers that are accessible to the assembly level programmer (or the compiler) of the machine. The architectur- 
ally visible subset of the register f fle would typically be common to all the threads (main and future threads). Architec- 
turally invisible registers include various physical registers of the CPU, a subset of which are mapped to the ardiitected 
registers, i.e., contain the values associated with the architected registers. The register file provides operands to the 
functional units for executing many of the instructions and also receives results of execution. Such a register file is well 

25 known in prior art 

As part of its implementation of the merge process (described later), the TM unit (Block 130) also communicates 
with the register fOe. to ensure that every architected register is associated with the proper non-architected physical reg- 
ister after the merge. 

Block 1 70 is a data cache unit of the processor which holds some of the data values used as source operands by 
30 the instructions and some of the data values generated by the executed instructions. Since multiple memory-resident 
data values may be simultaneously required by the muftfale functional units and multiple memory-bound results maybe 
simultaneously generated, the data cache would typically be multi-ported. Multi-ported data caches are well known in 
prior art. 

Block 180 is a bank of functional units (Functional Unit-1. Functional Unit-2, Functional Unit-K), where each unit is 
35 capable of executing some or all types of instructions. The functional units receive input source operands from and write 
the output results to the register file (Block 160) or the data cache (Block 1 70). In the preferred embodiment illustrated 
in Figure 1 , all the functional units are identical and hence capable of executing any instruction. Alternatively, the multi- 
ple functional units in the bank may be asymmetric, where a specific unit is capable of executing only certain subset of 
instructions. The scheduler (Block 150) needs to be aware of this asymmetry and schedule the instructions appropri- 
40 ately. Such trade-offs are common in prior art also. 

Hock 190 is an instruction completion unit which is responsible for completing instruction execution In an order con- 
sidered a valid order by the architecture. Even though a CPU may execute instructions out-of-order, it may or may not 
be allowed to complete them In the same order, depending on the architectural constraints. Instructions scheduled for 
execution by future thread dispatchers become candidate for completion by the completion unit only after tiie TM unit 
45 (Block 130) ascertains the validity of the future thread in case of a speculative thread. 

This invention proposes several new instructions which can be inserted in the instruction sequence at compile time. 
The details of the semantics of these instructions follow. 

1.FORK 

so This instruction identifies the beginning address(es) of one or more threads of instructions. Each identified thread 
of instruction is referred to as a future thread These future threads can be executed concurrently with the forking 
thread which continues to execute the sequence of instructions sequentially following the FORK. The starting CPU 
state for the future thread is a copy of the CPU state at the point of encountering the FORK instruction. 

55 2. UNCOND_SUSPEND 

On encountering this instruction, a future thread must unconditionally suspend itself, and await its merger with the 
forking thread. This may be needed for example, in cases where the instructions following the unconditional sus- 
pend instruction have essential data dependency with some instructions on a different thread. Since this proposed 
instruction does not require any other attribute, it could also be merged with the SUSPEND instruction (described 
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later). In other words, one of the encodings of SUSPEND instruction could simply specify an unconditional sus- 
pend. 

3. SUSPEND 

s On encountering this instruction, a future thread can continue to proceed with its instruction fetch and execution, 
but the results of the sequence of instructions between a first SUSPEND instruction and a second SUSPEND 
instruction or an UNCOND_SUSPEND instruction in the sequential trace order of the program, are discarded, if the 
compile-time specified condition associated with the first SUSPEND instruction evaluates to lalse at run time. To 
simplify the discussions to follow, we define the term dependence region of a SUSPEND instruction as the 

10 sequence of instructions in the sequential trace order that starts with the first instruction after the SUSPEND 
instruction and is terminated on encountering any other SUSPEND instruction or on encountering an 
UNCOND_SUSPEND instruction. 

4. SKIP 

75 Upon encountering this instruction, a future thread may just decode the next compile-time specified number of 
instructions (typically spill loads), and assume execution of these instructions by marking the corresponding source 
and destination registers as valid, but the thread need not actually perform the operations associated with the 
instructions. The main thread treats this instruction as a NOP. 

so 5. FORKSUSPEND 

The op-code of this instruction is associated with an address identifying the start of a future thread, and a sequence 
of numbers (N1, N2, . . . , Nn), each with or without conditions. The given sequence of n numbers refers to the n 
consecutive groups of instructions starting at the address associated with the FORK instruction. A number without 
any associated condition, implies that the corresponding group of instructions can be unconditionally executed as 

25 a future thread. A number with an associated condition implies that the future thread execution of the corresponding 
group of instructions would be valid only if the compile-time specified condition evaluates to true at run time. 

6. FORKS_SUSPEND 

The op-code of this instruction is associated with an address identifying the start of a future thread, a number s, 
30 and a sequence of numbers (N 1 , N2, . . . , Nn), each with or without conditions. The given sequence of n numbers 
refers to the n consecutive groups of instructions starting at the address associated with the FORK instruction. A 
number without any associated condition, implies that the correspondng group of instructions can be uncondition- 
ally executed as a future thread. A number with an associated condition implies that the future thread execution of 
the corresponding group of instructions would be valid only if the compile-time specified condition evaluates to true 
35 at run time. The associated number s refers to the s instructions, at the start of the thread, which may just be 
decoded to mark the corresponding source and destination registers as valid, but the thread need not actually per- 
form the operations associated with the instructions. 

7. FORKJUSUSPEND 

40 The op-code of this instruction is associated with an address identifying the start of a future thread, a set of masks 
(M1 , M2, . . . , Mn), each with or without conditions. A mask without any associated condition, represents the set of 
architected registers which unconditionally hold valid source operands for the future thread execution. A mask 
associated with a condition, refers to the set of architected registers which can be assumed to hold valid source 
operands for the future thread execution, only if the compile-time specified condition evaluates to true at run time. 

45 

8. FSKIP 

The op-code of this instruction is associated with a mask, and a number 8. Upon encountering this instruction, a 
future thread may skip the fetch, decode, and execution, of the next s instructions. The future thread further uses 
the mask to mark the defined set of architected registers as hokfing valid operands. The main thread treats this 
so instruction as a NOP. 

9. SKPMG 

Upon encountering this instruction, a future thread may just decode the next compile-time specified number of 
instructions (typically spill loads), to mark the corresponcfing source and destination registers as valid, but the 
55 thread need not actually perform the operations associated with the instructions. If this instruction is encountered 
by the main thread, a check is made to determine if a future thread was previously forked to the start at the address 
of this SKPMG instruction. If so, the main thread is merged with the corresponding future thread by properly merg- 
ing the machine states of the two threads and the main thread resumes the execution at the instruction following 
the instruction where the future thread was suspended. If there was no previous fork to this address, the main 
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thread continues to execute the sequence of instructions following this instruction. The importance of such an 
instruction is explained later. 

Detailed Description of Formats of the New Instructions: 

A detailed description of Figures 3a through 3i, illustrating the formats of the new instructions follows. 



1. FORK<addrJ>, <addr_2), . . . , <addr_n> 

The FORK instruction (Block 10) in Figure 3a, includes an op-code field (Block 1 1), and one or more address fields, 

10 addr_1, addr_2 addr_n (Blocks 12-1, 12-2 12-n), each identifying the starting instruction addresses of 

a future thread. 

2. UNCOND_SUSPEND 

The UNCOND_SUSPEND instruction (Block 20) in Rgure 3b, contains an op-code field. 

15 

3. SUSPEND <mode), <cond_1) <cond_2) . . . <cond_n> 

The SUSPEND instruction (Block 30) in Figure 3c, includes SUSPEND op-code field (Block 31), a mode field 
(Block 32), and a condition field (Block 33). A preferred embodiment of the invention can use the condition field to 
encode compfle-time speculation on the outcome of a sequence of one or more branches as, condj , cond_2, . . 
20 . , cond_n (Blocks 33-1 , 33-2, . . . , 33-n). The semantics of this specific condition-field encoding is explained in 
more detail below. 

The mode field is used for interpreting the set of conditions in the condition field in one of two ways. If the mode 
field is set to valid (V). the thread management unit discards the results of the set of instructions in the dependence 
region associated with the SUSPEND instruction, if any one of the compile-time specified conditions, among 

25 (corxM) through (cond_n>. associated with the SUSPEND instruction, evaluates to false at run time. If the mode 
field is set to invalid (I), the thread management unit discards the results erf the set of instructions in the dependence 
region associated with the SUSPEND instruction, if all of the compile-time specified conditions, from <cond_1> 
through <cond_n>, associated with tie SUSPEND instruction, evaluate to true at run time. 

Intuitively speaking, a compiler would use the valid mode setting for encoring a good path from the fork point 

30 to the merge point, whereas, it would use the invalid mode setting for encoding a bad path from the fork point to the 
. merge point 

The first condition in the sequence, corxM , is associated with the first unique conditional branch encountered by 
the forking thread at run time, after forking the future thread containing the SUSPEND instruction; the second con- 
dition in the sequence, cond_2, is associated with the second unique coraStional branch encountered by the forking 

35 thread at run time, after forking the future thread containing the SUSPEND instruction, and so on. Only the 
branches residing at different instruction locations are considered unique. Furthermore, the conditions which 
encode the compile-time speculation of a specific branch outcome, in a preferred embodiment can be either one 
of the following three: taken (7), not-taken (N), or, donl care (X). Alternately, the speculation associated with the 
conditions can be restricted to be either of the following two: taken (T), or not-taken (N). 

40 To further clarify the concfition encoding format, consider some example encodings: 

o SUSPEND V.TXN 

This encoding implies that the instructions following this conditional suspend instruction are valid only if the 
speculation holds. In other words, results of the set of instructions in the dependence region associated with 

45 the SUSPEND instruction, if all of the compile-time specified conditions, from (cond_1 > through <cond_n>, asso- 

ciated with the SUSPEND instruction evaluate to true at run time. Trie first control flow condition assumes that 
the first unique conditional branch encountered by the forking thread at run time, after forking the thread con- 
taining the SUSPEND instruction, is taken. The second such branch is allowed by the compiler to go either way 
(in other words a control independent branch), and the third such branch is assumed by the compiler to be not 

so taken 

o SUSPEND I, NTXNTXT 

This encoding implies that the instructions following this conditional suspend instruction are invalid only if the 
speculation holds. In other words, results of the set of instructions in the dependence region associated with 
65 the SUSPEND instruction, are discarded only if all of the compile-time specified conditions, from (conoM) 

through (cond_n>, associated with the SUSPEND instruction evaluate to true at run time. The first control flow 
condition assumes that the first unique conditional branch encountered by the forking thread at run time, after 
forking the thread containing the SUSPEND instruction, is not taken. The second such branch is assumed by 
the compiler to be taken, the third such branch is allowed by the compiler to go either way (in other words a 
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control independent branch), the fourth such branch is assumed by the compiler to be not taken, the fifth such 
branch is assumed by to be taken, the sixth such branch is allowed to go either way, and, the seventh such 
branch is assumed to be taken. 

5 Note that if the forking thread code in the region after the fork and before the merge, is restricted to be loop-free, the 
dynamic sequence of branches encountered in the taking thread after the fork, would be ail unique. In other words, 
under these circumstances, the first unique conditional branch would simply be the first dynamically encountered con- 
ditional branch, the second unique conditional branch would simply be the second dynamically encountered conditional 
branch, and so on. 

10 The condition format explained above is also used in specifying compile-time speculation conditions in case of 
FORKSUSPEND, FORK_S_SUSPEND, and FORK_M_SUSPEND instructions. The preferred embodiment assumes 
a valid mode field setting in the condition field encodings used in FORK_SUSPEND, FORKS_SUSPEND, and 
FORK_M_SUSPEND instructions, implying that the thread management unit discards the results of the set of Instruc- 
tions in the dependence region associated with the SUSPEND instruction, if any one of the compile-time specified con- 

15 ditions, among <cond_1> through <cond_n>, associated with the SUSPEND instruction evaluates to false at run time 

4. FORKjSUSPEND (addo. <N1.condJ> . . <Nn,cond_n> 

The FORKJSUSPEND instruction (Block 40) in Figure 3d, includes an op-code field (Block 41), an address field 

(Block 42), and one or more condition fields (Blocks 43-1, 43-2 43-n), each associated with a count field, and 

20 one or more conditions. The preferred format for the conditions is same as that explained above in the context of 
SUSPEND instruction, assuming valid mode field. 

5. SKIP <n> 

The SKIP instruction (Block 50) in Figure 3e, includes an op-code field (Block 51), a count field (Block 52), speci- 
es fying the number of instructions after tiiis instruction whose execution can be skipped, as explained above in the 
context of SKIP instruction. 

6. FORK_S_SUSPEND (addo. <N>. {N1,cond_1>..<to,cond_n> 

The FORKASUSPEND instruction (Block 60) in Figure 3f, includes an op-code field (Block 61). an address f ield 
30 (Block 62), a count field (Block 63) specifying the number of instructions, at the start of the thread, which can be 
skipped in the sense explained above (in the context of SKIP instruction) and one or more concfition fields (Blocks 

64-1 . 64-2 64~n). each associated with a count field, and one or more concfitions. The preferred format for the 

conditions is same as frat explained above in the context of SUSPEND instruction, assuming valid mode field. 

3$ 7. FORK_M_SUSPEND (addo. <M1 .cond _1> . . . (Mn,cond_n> 

The FORK_M_SUSPEND instruction (Block 70) in Figure 3g, includes an op-code field (Block 71), an address field 

x (Block 72), and one or more condition fields, (Blocks 73-1 , 73-2 73-n). each associated with a mask field, and 

one or more conditions. Each mask field contains a register mask specifying the set of architected registers that 
hold valid source operands, provided the associated conditions hold at run time The preferred format for the con- 

40 ditions is same as that explained above in the context of SUSPEND instruction, assuming valid mode field. 

8. FSKlP<mask)(n) 

The FSKIP instruction (Block 80) in Figure 3h, includes an op-code field (Block 81), and a mask field (Block 82) 
defining a set of registers, and a count field (Block 83), specifying the number of instructions that can be completely 
45 skipped, as explained above in the context of FSKP instruction. 

9. SKPMG<n> 

The SKPMG instruction (Block 90) in Figure 3i. includes an op-code field (Block91), a count field (Block 92), spec- 
ifying the number of instructions after this instruction whose execution can be skipped, as explained above in the 
so context of SKPMG instruction. 

THE MERGE ACTION: MERGING OF THE FORKED THREAD WITH A FORKING THREAD: 

The forked (future) thread is merged with the corresponding forking thread (e.g.. the main thread) when the forking 
55 thread reaches the start of the forked future thread. Merging is accomplished by merging the CPU states of the two 
threads such that the CPU states defined by the forked thread supersede, while the rest of the states are retained from 
the forking thread. CPU state of a thread would typically include the architecturally visible registers used and defined by 
the thread. The forking thread program counter is updated to continue execution such that the instructions properly exe- 
cuted by the merged forked thread are not re-executed and any instruction not executed by the merged forked thread 
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is appropriate!/ executed; Property executed instructions refer to those instructions that do not violate any essential pro- 
gram dependencies. The forking thread continues the execution past the latest execution point of the merged thread, 
and, the instructions properly executed by the merged future thread become candidates for completion, at the end of 
the merge process. The resources associated with the merged future thread are released at the end of the merge proc- 
s ess,. Note that at the time of merge, the forked future thread is either already suspended, or still actively executing. In 
either case, at the end of the merge process, the merged future thread effectively ceases to exist Also note that in the 
absence of an explicit suspend primitive, such as. UNCOND_SUSPEND, a forked future thread would always continue 
to execute until the merge. 

10 OPTIONAL NATURE OF FORKS: 

A novel characteristic of the instructions proposed in this invention is that their use at compile time does not require 
any assumption regarding the run-time CPU resources. Depending on the aggressiveness of an actual implementation, 
a specific CPU may or may not be able to actually forte a future thread. In other words, from the CPU's point of view, an 
actual fork at run time in response to encountering any FORK instruction, is entirely optional. The user of these Instruc- 
tions (e.g., the compiler) does not need to keep track of the number of pending future threads, and it also cannot 
assume any specific fork to be definitely obeyed (i.e., fork a future thread) at run time. 

The compiler identifies control and data independent code regions which may be executed as separate (future) 
threads. However, the compiler does not perform any further restructuring or optimizations which assume that these 
threads wiO execute in parallel. For example, the cpmpiler preserves any spill code that would be needed to guarantee 
correct program execution when any one of the inserted FORK instructions is ignored by the CPU at run time. Spill code 
refers to the set of instructions which are inserted at compile time, to store the contents of any architecturally visible 
CPU register in a certain location in the instruction cache, and later reloading the contents of the same location without 
another intervening store. Note that the execution of spill code may be redundant during its execution as a future thread. 
To optimize the handling of such spill code during future thread execution, the invention adds the SKIP instruction and 
its variants, such as. FSKIP and SKPMG. which enable compiie-time hint for reducing or eliminating the redundant spill 
code execution. The detailed semantics of this new instruction is described above. 

Note that as a direct consequence of the optional nature of FORK instructions, there is no need for re-compilation 
for drfferent implementations of this enhanced machine architecture, each capable of forking zero or more threads. Sim- 
ilarly, mere is no need to recompile any old binary, which does not contain any of the new instructions. 

INTERPRETING MULTIPLE CONDITIONAL SUSPENDS IN A FUTURE THREAD: 

It is possWe that a future thread which gets forked in response to a FORK instruction, encounters a series of con- 
ditional suspends before encountering an unconditional suspend. Each conditional suspend is still interpreted in asso- 
ciation with the common fork point ami independent of other conditional suspends. Thus, It is possible to associate 
different control flow speculations with different portions of a future thread. Consider a SUSPEND instruction A Sup- 
pose A is followed by another SUSPEND in- strucrjon B. after a few instructions other than FORK, SUS- PEND, 
UNCOND__SUSPEND. FORKS_SUSPEND, FORK_M_SUSPEND, or SKPMG instructions. SUSPEND instruction B 
would typically be foflowed later by an UNCOND_SUSPEND instruction. Assume that the compile-time condition asso- 
ciated with the SUSPEND instruction A is determined to be false at run time. To simplify the compilation and to reduce 
the state keeping in future threads, a preferred embodiment of this invention can simply discard the results of all Instruc- 
tions between A and the UNCOND.SUSPEND instruction, instead of limiting the discarding to between A and B 

45 SIMPLIFIED IDENTIFICATION OF MERGE-POINTS: 

It may be possible at compile time to group all the spill loads in the future thread and move the group to the top of 
the block, where future thread execution will begin. If the compiler further ensures that the first instruction of every 
poten- tial future thread is the new SKPMG instruction, then this instruction serves both as an indicator of the spHI loads 

so that can be skipped, and as a marker for the start of the future thread. The semantics of this instruction has been 
described above. Note that in the absence of such a future thread marker (in the form of SKPMG), the main thread may 
constantly need to check its instruction address against all previously forked future threads to detect if a merge is 
needed. Also note that even if the number of instructions being skipped is zero, the compiler must still insert this 
SKPMG instruction, as it serves the additional functionally of a future thread marker in this interpretation. 

65 Figure 2 is a flow chart showing the steps of the present method of execution, referred to as the Primary Execution 
Methodology (PEM). A detailed description of Figure 2, along with a description of the present method, follows. 

1 . Find fork points (Block 210): 

Generate a static sequence of instructions using techniques known in the art, without any regard to the new instruc- 
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tions proposed in this invention. Analyze this sequence ol instructions to determine a set of fork points. A fork point 
refers to the position in the static instruction sequence where the available machine state is capable of starting a 
parallel execution of one or more sets of instructions which appear later (but not immecfiately after the fork point) in 
the sequential trace order. The identification of fork points involves data and control dependence analysis* based 
5 on some or all of the corresponding program dependence graph (combination of control dependence graph and 
data dependence graph), using techniques known in the prior art For example, the resolution of a branch instruc- 
tion can lead to a fork point for the threads of instructions that are essentially control dependent on the branch 
instruction. 

10 2. Insert FORKs (Block 220): 

Insert zero or more FORK instructions at zero or more of the potential fork points, at compile time, where, the FORK 
instruction is capable of identifying the starting addresses of zero or more potential future threads, associated with 
the fork point. The association of a specific FORK instruction with its forked future thread(s). if any, is managed by 
the TM unit described above. 

15 

3. Load static sequence (Block 230): 

Load the static sequence of instructions generated after the previous (Insert FORKs) step (Block 220) into the 
memory system (Block 100 of Figure 1) starting at a fixed location, where the memory system is interfaced to the 
instruction cache of the central processing apparatus, and subse- quences of the static sequence are periodically 
20 trans- ferred to the instruction cache. 

4. Fetch and merge-check (Block 240): Fetch the instruction sequence from the instruction cache by addressing 
the sequence through the main program counter (i.e., as a main thread) starting at a current address, and updating 
the program counter. Instructions missing in the instruction cache are fetched from the main memory into the 

25 cache. Along with the instruction fetch, a check is also made to determine if there is one or more unmerged future 
threads starting at the current instruction fetch address. The TM unit (Block 130 of Figure 1) is also responsible for 
this carrying out this implicit mergecheck. This check would normally involve comparing each instruction fetch 
address against the starting addresses of all unmerged (pending) future threads. 

30 5. Thread validity check (Block 250): 

In case it is determined in the previous step pock 240) that one or more future threads had been forked previously 
at the instruction fetch address of another execution thread (e.g., the main thread), a further check is made by the 
TM unit to ascertain if some or all of the instructions executed by each of these future threads need to be cfiscarded 
due to any violation of program dependencies, resulting from one or more speculations. 

35 

6. Merge (Block 260): 

Validly executed portions of the forked future threads identified in the previous (Thread validity check) step (Block 
250), are merged with the main thread via the merge operation described before. 

40 7. Decode (Block 270): 

Decode the fetched Instructions In the dispatcher. Check to see if one or more of the Instructions are decoded as 
a FORK instruction. 

a Execute main thread (Block 280): 
4$ For any instruction decoded as other than FORK instructions in the previous (Decode) step (Block 270), continue 
execution by analyzing the instruction dependencies (using Block 140 of Figure 1), and by scheduling them for exe- 
cution (using Block 150 of Figure 1) on appropriate functional units (Block 180 of Figure 1). 

9. Complete {Block 290): 

so Complete instruction execution through the completion unit (Block 1 90 of Figure 1 ), as described above. The proc- 
ess of fetch, decode, and execute, described in steps 4 through 9, continues. 

10. Determine fork-ability (Block 300): 

If an instruction is decoded as a FORK instruction in the (Decode) step associated with Block 270 above, a check 
65 is made to determine the availability of machine resources for forking an additional future thread. Machine 
resources needed to fork a future thread include an available program counter, available internal buffer space for 
saving thread state. 
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11. Fork (Block 310): 

In case there are resources available, the TM unit tori® future thread(s) by loading the address(es) associated with 
the FORK instruction into future program counters). This starts off future thread(s) execution, where, the starting 
machine state (except the program counter) of a future thread is same as that of the main thread (the thread decod- 
5 ing the associated FORK instruction) ,at the fork point 

12. Execute future thread (Block 320): A future thread execution proceeds, in parallel with the forking thread exe- 
cution, in a manner similar to steps (4) through (8) above, except using one of the future program counters and one 
of the future thread dispatchers, instead of the main program counter and the main thread dispatcher, respectively, 

10 and referring to the main thread as the forking thread instead. 

13. Stop future thread (Block 330): 

A future thread execution is suspended and the associated resources are released, after the future thread is 
merged with the forking thread or after the future thread is discarded by the TM unit. 

is 

Some enhancements to the primary execution methodology (PEM) described above, are described below. 
Alternative Embodiment 1 : 
20 1. Step (2) in the PEM has the following additional substep: 

o An UNCOND.SUSPEND instruction is inserted at the end of every future thread. 

2. Step (12) in the PEM the following additional substep: 

25 Upon encountering an UNCOND_SUSPEND instruction, during its corresponding future thread execution, a future 
thread unconditionally suspends itself. 

. 3. Step (8) in the PEM the following additional substep: 

30 o If an UNCOND_SUSPEND instruction is encountered for execution by a thread other than its corresponding 
future thread (e.g.. in the main thread), it is ignored. 

Alternative Embodiment 2: 

3$ 1. Step (1) in the PEM with alternative embodiment 1. has the following additional substep: 

o Corresponding to every UNCOND_SUSPEND instruction, zero or more SUSPEND instructions may be 
inserted in the corresponding future thread, where, each SUSPEND instruction is associated with a condition. 

40 2. Step (2) in the PEM with alternative embodiment 1 , has the following additional substep: 

o The set of instructions in the dependence region associated with a SUSPEND instruction are considered vaOd 
for execution in the corresponding future thread only if the compiie-time specified condition associated with the 
SUSPEND instruction evaluates to true at run time. Therefore, a future thread can also be forced to suspend 
45 (by the TM unit) at a conditional suspend point, if the associated speculation is known to be invalid by the time 

the future thread execution encounters the conditional suspend instruction. 

3. Step (3) in the PEM with alternative embodiment 1, has the following additional substep: 

so o If a SUSPEND instruction is encountered for execution by a thread other than its corresponding future thread 
(e.g., in the main thread), it is ignored. 

Alternative Embodiment 3: 

55 1. Step (1) in the PEM with alternative embodiment 2. has the following additional substep: 

o Zero or more SKIP instruction may be inserted in a future thread, where, each SKIP instruction is associated 
with a number, s. 
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2. Step (2) in the PEM with alternative embodiment 2 f has the following additional substep: 

o Upon encountering a SKIP instruction, with an associated number, s, during Hs corresponding future thread 
execution, the next s instructions following is instruction, may only need to be decoded, and the remaining exe- 
cution of these instructions can be skipped. The source and destination registers used in these instructions can 
be marked as holding valid operands, but, these s instructions need not be scheduled for execution on any of 
the functional units. 

3. Step (3) in the PEM with alternative embodiment 2, has the following additional substep: 

o If a SKIP instruction is encountered for execution by a thread other than its corresponding future thread (e.g., 
in the main thread), it is ignored. 

Alternative Embodiment 4: 

1. Step (1) in the PEM with alternative embodiment 2, has the following additional substep: 

o Zero or more FSKIP instruction may be inserted in a future thread, where, each FSKIP instruction is associated 
with a mask, defining a set of architected registers, and a number, & 

2. Step (2) in the PEM with alternative embodiment 2, has the following additional substep: 

o Upon encountering an FSKIP instruction, with an mask, and a number, s, during its corresponding future 
thread execution, the next s instructions following this instruction can be skipped. In other words these instruc- 
tions need not be fetched, decoded or executed. The registers identified in the mask can be marked as holding 
valid operands. 

3. Step (3) in the PEM with alternative embodiment 2, has the following additional substep: 

o If an FSKIP is encountered for execution by a thread other than its corresponding future thread (e.g., in the 
main thread), rt is ignored; 

Alternative Embodiments: 

1. Step (1) in the PEM with alternative embodiment 2, has the following additional substep: 

o A SKPMG instruction is inserted at the start of every future thread, where, each SKPMG instruction is associ- 
ated with a a number, s. 

2. Step (2) in the PEM with alternative embodiment 2, has the following additional substep: 

o Upon encountering a SKPMG instruction, with an associated number, s, during its corresponding future thread 
execution, the next s instructions following this instruction, may only need to be decoded, and the remaining 
execution of these instructions can be skipped. The source and destination registers used in these instructions 
can be marked as holding valid operands, but, these s instructions need not be scheduled for execution on any 
of the functional units. 

3. Step (3) in the PEM with alternative embodiment 2, has the following additional substep: 

o If a SKPMG is encountered for execution by a thread other than its corresponding future thread (e.g., in the 
main thread), a merge-check is made to determine if a future thread has been forked in the past starting at the 
instruction address of the SKPMG in- struction. 

4. The implicit merge-check in Step (4) of the PEM is now unnecessary and hence dropped. 
Alternative Embodiment 6: 

1. The Insert FORKs step (Le., Step-3) in the PEM is replaced by the following step: 
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o Insert zero or more FORK.SUSPEND instructions at zero or more of the potential fork points, where, the 
FORK_SUSPEND instruction contains an address identifying the starting address of an associated potential 
future thread, and a sequence of numbers each with and without a condition, where, the given sequence of 
numbers refers to the consecutive groups of instructions, starting at the address associated with the 
5 FORK.SUSPEND instruction. The association of a specific FORK_SUSPEND instruction with its forkrd future 

thread, if any. is managed by the TM unit descrfoed above. 

2. The Determine fork-ability step (i.e., Step- 10) in the PEM is replaced by the following step: 

10 o For an instruction decoded as a FORKJ3USPEND instruction, checking to determine the availability of 
machine resources for forking an additional future thread, 

3. The Fork step (I.e., Step-11) in the PEM is replaced by the following step: 

is o Forking a future thread, if there are resources available, by loading the address(es) associated with the 
FORK_SUSPEND instruction into future program counters), 

4. The Execute future thread step (i.e., Step-1 2) in the PEM has the following additional substep: 

so o The number sequence associated with the FORK_SUSPEND instruction controls the execution of the corre- 
sponding future thread in the following manner. A number, say, n without any associated condition, implies that 
the corresponding group of n instructions can be unconditionally executed as a future thread, and a number, 
say, m with an associated condition, implies that the future thread execution of the corresponding group of m 
instructions would be valid only if the comple-time specified condition evaluates to true at run time. 

25 

Alternative Embodiment 7: 

1. The Insert FORKs step (i.e., Step-3) in the PEM is replaced by the following step: 

30 o Insert zero or more FORK_S_SUSPEND instructions at zero or more of the potential fork points, where, a 
FORK_S_SUSPEND instruction contains an address identifying the starting address of an associated poten- 
tial future thread, a number, say, s, and a sequence of numbers each with and without a condition, where, the 
given sequence of numbers refers to the consecutive groups of instructions, starling at the address associated 
with the FORK_S_SUSPE!MD instructions. 

35 

2. The Determine fork-ability step (ie., Step-10) in the PEM is replaced by the following step: 

o For an instruction decoded as a FORK_S_SUSPEND instruction, checking to determine the availability of 
machine resources for forking an additional future thread, 

3. The Fork step (i.e., Step-1 1) in the PEM is replaced by the following step: 

o Forking a future thread, if there are resources available, by loading the address(es) associated with the 
FORK_S_SUSPEND instruction into future program counters), 

4. The Execute future thread step (i.e., Step-1 2) in the PEM has the following additional substep: 

o The number sequence associated with the FORK_S_SUSPEND instruction controls the execution of the cor- 
responding future thread in the following manner. During the execution of the corresponding thread as a future 

so thread, the first s instructions may only be decoded, and the source and destination registers used in these s 

instructions may be marked as holding valid operands, but, these instructions need not be scheduled for exe- 
cution on any of the functional units. Furthermore, a number, say, n without any associated condition, implies 
that the corresponding group of n instructions can be unconditionally executed as a future thread, and a 
number, say, m with an associated condition, implies that the future thread execution of the corresponding 

55 group of m instructions would be valid only if the compile-time specified condition evaluates to true at run time. 

Alternative Embodiment 8: 

1. The Insert FORKs step (i.e., Step-3) in the PEM is replaced by the following step: 



40 
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o Insert zero or more FORK_M_SUSPEND instructions at zero or more of the potential fork points, where, a 
FORK_M_SUSPEND instruction contains an address identifying the starting address of an associated poten- 
tial future thread, and a set of masks, each with or without an associated condition. 

5 2. The Determine fork-ability step (i.e., Step-10) in the PEM is replaced by the following step: 

o For an instruction decoded as a FORK_M_SUSPEND instruction, checking to determine the availability of 
machine resources lor forking an additional future thread, 

10 3. The Fork step (i.e., Step-1 1) in the PEM is replaced by the following step: 

o Forking a future thread, if there are resources available, by loading the address(es) associated with the 
FORK_M_SUSPEND instruction into future program counters), 

is 4. The Execute future thread step (he., Step-12) in the PEM has the following adcfitional substep: 

o The mask sequence associated with the FORK_M_SUSPEND instruction controls the execution of the corre- 
sponding future thread in the following manner. During the execution of the corresponding thread as a future 
thread, a mask associated with the FORK_M_SUSPEND instruction, without any condition, represents the set 

20 of architected registers which unconditionally hold valid source operands for the future thread execution, ami 

a mask associated with a condition, refers to the set of architected registers which can be assumed to hold 
valid source operands for the future thread execution, only if the compfle-time specified condition evaluates to 
true at run time. The TM unit discards the results of some or all of the instructions in the future thread if the 
compfle-time specified concfitions associated with the source regsiter operands of the instructions do not hold 

25 true at run time. 

Alternative Embodiment 9: 

1. The Execute main thread step (i.e., Step-8) in the PEM has the following additional substep: 
30 Every branch resolution (i.e., the determination of whether a conditional branch is taken or not, and the associated 
target address) during a thread execution is communicated to the TM unit The TM unit uses this information to 
determine if a future thread forked to the incorrect branch address, and any dependent threads, need to be dis- 
carded. This enables simultaneous execution of control dependent blocks of instructions, as illustrated later. 

35 Alternative Embodiment 10: 

1. The Fetch and merge-check step (i.e., Step-4) in the PEM has the following additional substep: 

The merge-check is extended to include a check to see if any of the previously forked threads, has stayed 

unmerged for longer than a pre-specif ied time-out period. Any such thread is cfiscarded by the TM unit 

40 

Detailed Description of Encodings of the New Instructions: 

Figures 4a through 4d illustrate the preferred encodings of some of the new instructions. Bit position 0 refers to the most 
significant bit position, and bit position 31 refers to the least significant bit position. 

45 

1. FORK (Figure 4a) 

This instruction (Block 1 1 1) uses the primary op-code of 4, using bits 0 through 5. The relative address of the start- 
ing address of the future thread is encoded in the 24-bit address f ield in bit positions 6 through 29. The last two bits, 
bit positions 30 and 31 are used as extended op-code field to provide encodings for alternate forms of FORK 
so instruction. These two bits are set to 0 for this version of the FORK instruction. 

2. UNCOND_SUSPEND (Figure 4b) 

This instruction (Block 222) uses the primary op-code of 19 in bit positions 0 through 5. Bits 21 through 30 of the 
extended op-code field are set to 51 4 to distinguish it from other instructions with the same primary op-code. Bit 31 
55 is set to 0 to distinguish this unconditional suspend instruction from the conditional suspend (SUSPEND) instruc- 
tion. 

3. SUSPEND (Figure 4c) 

This instruction (Block 333) uses the primary opcode of 19 in bit positions 0 through 5. Bits 21 through 30 of the 
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extended op-code field are set to 51 4 to distinguish it from other instructions with the same primary op-code. Bit 31 
is set to 1 to distinguish this conditional suspend instruction from the unconditional suspend 
(UNCOND_SUSPEND) instruction. Compile-time branch speculations are one of the following: taken not-taken, or 
dont care. Therefore 2 bits are used for each of the seven compile-time branch speculations, C 1 through C7, using 

5 bit positions 7 through 20. The first condition in the sequence, C1 (bits 7 and 8), is associated with the first unique 
conditional branch encountered by the forking thread at run time, after forking the future thread containing the SUS- 
PEND instruction, ... the seventh condition in the sequence, C7, is associated with the seventh unique conditional 
branch en countered by the forking thread at run time, after forking the future thread containing the SUSPEND 
instruction. The mode field is encoded in bit position 6. The semantics associated with this encoding has already 

w been explained above in the context of SUSPEND instruction. 

4. FORKSUSPEND (Figure 4d) 

This instruction (Block 444) also uses the same primary op-code of 4 as that used for the FORK instruction above, 
in bit positions 0 through 5. However, the extended op-code field (bits 30 and 31) is set to 1 to distinguish it from 

15 the FORK instruction. The relative address of the starting address of the future thread is encoded in the 10-bit 
address field in bit positions 20 through 29. Compile-time branch speculations are one of the following: taken not- 
taken, or donl care. Therefore 2 bits are used for each of the four compile-time branch speculations, C1 through 
C4. The first condition in the sequence, C1 , is associated with the first unique conditional branch encountered by 
the forking thread at run time, after forking the future thread containing the SUSPEND instruction, ... the fourth 

20 condition in the sequence, C4, is associated with the fourth unique conditional branch encountered by the forking 
thread at run time, after forking the future thread containing the SUSPEND instruction. The first number, N1 (bits 6 
through 8) refers to the number of valid instructions starting at the starting address of the future thread, assuming 
conditions associated with both C 1 (bits 9 and 1 0) and C2 (bits 1 1 and 12) are evaluated to hold true at run time. 
Whereas, N2 (bits 13 through 15) refers to the number of valid instructions starting at the starting address of the 

25 future thread + N1 instructions, assuming conditions associated with both C3 (bits 16 and 17) and C4 (bits 18 and 
19) are evaluated to hold true at run time 

EXAMPLES 

30 Figures 5 and 6 illustrate the use of some of the instructions proposed in this invention, in samples of code 
sequences. Code sequences shown have been broken into blocks of non-branch instructions, optionally ending with a 
branch instruction. Instruction mnemonics used are either those introduced in this invention (e.g., FORK), or those of 
the PowerPC architecture. (PowerPC is a trademark of the International Business Machines. Corp) Any block of code- 
sequence that ©ids with a conditional branch has one edge labelled N to the block to which the control is transferred if 

35 the branch is not taken, and another edge labelled T to the block to which the control is transferred if the branch is taken. 
Figure 5 illustrates the use of the instructions proposed in this invention for speculating across control independent 
blocks of instructions. FORK, SUSPEND, and UNCOND_SUSPEND instructions have been used to enable simultane- 
ous fetch, decode, speculation and execution of different control independent blocks, such as, B1 and B12, in Figure 5. 
When the control reaches from block BO to B1, FORK instruction is used in block B1 to start off parallel execution of 

40 control-independent block B12 in parallel with B1. Note that the main thread executing B1 can follow one of several 
paths but they an lead to block B1 2, executed as a future thread. Similarly, in case of a resolution of the branch at the 
end of block B1 to B3, FORK instruction is used for parallel exeuction of control-independent block B9. The thread exe- 
cuting block B3 merges with the future thread started at B9, after executing either block B6 or B7. 

Unconditional suspends, or, UNCONDJ5USPEND instructions are used in the future thread executions of blocks 

45 B9 and B12 to observe essential dependencies resulting from updates to architected register 2, and memory location 
mem6 respectively. Conditional suspend, or, SUSPEND instruction is used in block B9 to speculatively execute next two 
instructions, as- suming the forking thread (executing block B3) flows into block B7 at run time and avoids block B6 
(which updates register 3), as a result of the branch at the end of block B3. Similarly, assuming the control does not flow 
into block B1 0 (which updates register 4), SUSPEND instruction is used to speculatively execute next four instructions. 

so Note that the path to be avoided, namely the path from the fork-point in block B1 to the merge-point in block B12, via 
blocks B2 and BIO, is coded at compile-time using the path expression TXT. This expression implies that the first 
unique conditional branch after the fork point i.e., the branch at the end of B1 is taken, the second branch, i.a, the 
branch at the end of B2 can go either way, and the branch at the end of B8 is also taken Note that there more more than 
one good paths (i.e., the paths with no update to register 4) in this case. The branch at the end of block B2 can go either 

55 to block B4 or block B5, and either of those paths would be considered good, if the branch at the end of B8 is not taken 
and falls through to B11. 

Note that the spill loads at the beginning of block B1 2 in Figure 5, have been preserved by the compiler to guaran- 
tee the optional nature of the forks. Also note the use of SKIP instruction in Figure 5. to optimize away the redundant 
execution of spill loads, if B1 2 is executed as a future thread. 
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Figure 6 illustrates the use of FORK and SUSPEND Instructions for speculating across control dependent blocks 
of instructions. FORK instructions have been used to fork from block B100 to control dependent blocks B200 and B300. 
The control dependent block B200 and B300 are executed speculatively, and in parallel While the main thread executes 
block B100, forked future threads execute block 200 and block B300. Upon the resolution of the branch at the end of 
block B100, the TM unit discards the future thread condi- tioned on the incorrect branch outcome. For example, if the 
branch is taken, the future thread starting at B200 is discarded. In the following it is explained in more detail how the 
instructions proposed above help solve the problems identified before. 

1. ALLEVIATING THE INSTRUCTION-FETCH BOTTLENECK 

As illustrated in the example above, proposed fork and suspend instructions off©' a novel way of addressing the 
instruction fetch bottleneck of current superscalar processor. The compiler can use these instructions to point to 
arbitrarily far (dynamically) control independent blocks. Control independence implies that given that the program 
control has reached the fork point it is bound to reach these future blocks (assuming of course, no interrupt that can 
alter the flow in an unforeseeable manner). Therefore, an instruction can be fetched as soon as the control depend- 
ence of its block is resolved (without waiting for the control flow). Also, speculatively fetched instructions should only 
be discarded if the branch from which they derive their control dependence (not the one from which control flow is 
derived) is mispredicted. For example, instructions in block B9 can be fetched along with those of block B3, soon 
after their shared control dependence on block B1 is either resolved or speculated. Furthermore instructions from 
biock B9 should be considered a wasted fetch or cfis- carded only if the control dependent branch at the end of block 
B 1 is mispredicted and not if the branch at the end of block B3 is mispredicted. A traditional superscalar without any 
notion of control dependence would discard its speculative fetches of blocks B7 (or B6) as well as B9rf blocks B7 
and B9 are fetched via traditional control flow speculation of the branch at the end of block B3 and this turns out to 
be a misprediction later oa 

Z EXPLOITING DATA INDEPENDENCE ACROSS CONTROL INDEPENDENT BLOCKS 
The instructions in the control independent blocks which are also data independent of all posstole control f tow paths 
' leading to these blocks, can be executed simultaneously and non-speculatively. via multiple forks to these control 
independent blocks. For example, the first three instructions in block B9 (which is control independent of B3) are 
data independent of instructions in biock B3. B6 and B7 (the set of basic blocks on the set of control flow paths from 
B3 to B9). Hence they can be fetched and executed non-speculatively using the proposed fork and suspend 
instructions. 

3. SPECULATING DATA DEPENDENCE ACROSS CONTROL INDEPENDENT BLOCKS 
To increase the overlap between future thread and main thread activities, there has to be some form of speculation 
on potential data dependence in the future thread. Consider the example in Figure 5. There is only one definition 
of register 4 in blocks B1 tivough B1 1. ft is defined in block B10. Speculating on the main thread control flow, i.a, 
assuming that the main thread control flow does not reach block B1 0, it is possfcle to increase the overlap between 
the future thread starting at the beginning of block B12 and the mam thread continuing through block B1 . The exact 
control flow leading to the offending instruction in block Bio is encoded as (TXT) as part of the proposed conditional 
suspend instruction. Note that the control flow speculation is being done at compile time and hence based on static 
branch prediction (and/br profile driven) techniques only. Also note that the net effect here is similar to speculatively 
boosting the instructions between the conditional and the unconditional suspend instructions. But unlike previously 
known techniques of guarded (or boosted) instructions, which encode the control f tow condition as part of each 
guarded (or boosted) instruction, the proposed technique encodes the condi- tion for a group of instructions using 
conditional and unconditional suspend instructions. Some of the important advantages of this approach are the fol- 
lowing: 

o MINOR ARCHITECTURAL IMPACT 

As impfied above, a primary advantage of the proposed scheme is its relatively minimal architectural impact. 
Except for the addition of fork and suspend instructions (of which only the fork needs a primary op-code 
space), the existing instruction encodings are unaffected. Therefore, unlike the boosting approach, the pro- 
posed mechanism does not depend on available bits in the op-code of each boosted instruction to encode the 
control flow speculation. 

0 PRECISE ENCODING OF THE SPECULATED CONTROL FLOW 

Since the control flow speculation is encoded exclusively in a new (suspend) instruction, one can afford to 
encode it precisely using more bits. For example, a compromise had to be reached in the boosting approach 
to only encode the depth of the boosted instruction along the assumed flow path (each branch had an assumed 
outcome bit, Indicating the most likely trace path). This compromise was necessary to compactly encode the 
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speculated control flow, so that it could be accommodated in each boosted instruction's op-code. As a result 
of this compromise, a speculatively executed boosted instruction was unnecessarily discarded on the mispre- 
diction of a control independent branch. In the approach proposed here, control independent branches along 
the speculated control flow path are properly encoded with an X, instead of Nor T. Hence, a speculatively exe- 
cuted instruction in the future thread is not discarded on the misprediction of a control independent branch. 

0 SMALL CODE EXPANSION 

The typical percolation and boosting techniques often require code copying or patch-up code in the path off the 
assumed trace. This can lead to signif icant expansion of code size. The proposed technique does not have any 
of these overheads and the only code expansion is due to the fork and suspend instructions, which are shared 
by a set of instructions. 

o SIMPLER IMPLEMENTATION OF SEQUENTIAL EXCEPTION HANDLING 

There is no upward code motion in the proposed technique, and the code speculatively executed still resides 
in its original position only. Therefore, exception handling can be easily delayed until the main thread merges 
with the future thread containing the exception causing instruction. In other words, exceptions can be handled 
in proper order, without having to explicitly mark the original location of the speculative instructions which may 
raise exceptions. 

o SIMPLER IMPLEMENTATION OF PRECISE INTERRUPTS 

The unique main thread in this proposal, is always precisely aware of the last instruction completed in the 
sequential program order. Therefore, there is no need of any significant extra hardware for handing interrupts 
precisely. 

4. Decoupling of Compilation and Machine implementation 

Note that due to the optional nature of the forks, as explained before, the compilation for the proposed architecture 
can be done assuming a machine capable of large number of active threads. And the actual machine implementa- 
tion has the option of obeying most of these forks, or some of these forks, or none of these forks, depending on 
avaaable machine resources. Thus compilation can be decoupled to a large extent in this context from the machine 
implementation. This also implies that there may be no need to recompile separately for machines capable of small 
or large number of active threads. 

5. Parallel Execution of Loop Iterations 

Proposed forte and suspend instructions can also be used to efficiently exploit across iteration parallelism in nested 
loops. For example, consider the sample loop illustrated before from the one of the SPEQ*rrt92 benchmarks. The 
inner loop iterations of this loops are both control and data dependent on previous iterations. However, each acti- 
vation of the inner loop fl.e., the outer loop iterations) is independent of the previous one. Hence, it is possible for 
the compiler to use the proposed fork instruction (starting at outer loop body) to enable a machine to start many 
activations of the inner loop without waiting for the previous ones to complete, and without unnecessarily discarding 
executed instructions from the outer loop iterations on misprediction of some control and data-independent iteration 
of the inner loopi 

6. Easing of Register Pressure 

Instructions in the control independent basic blocks which are also data in dependent of each other can be not only 
fetched but executed as well. The obvious question one might ask is why were these data and control independent 
instructions not percolated up enough to be together in the same basic block? Although a good conpiler would try 
its best to achieve such percolations, it may not always be able to group these instructions together. As mentioned 
before, to be able to efficiently group together all data and control independent instructions, the compiler needs to 
have enough architected registers for proper encoding. For example, suppose some hypothetical machine in the 
example used in Figure 5, only provides four architecture registers, register 1 through register 4. The conpiler for 
such a machine cannot simply group the control and data independent instructions in basic blocks B1 and B12, 
without inserting additional spill code. The fork mechanism allows the compiler to convey the underlying data inde^ 
pendence without any additional spill code. In fact some of the existing spill code may become redundant (e.g., the 
first two loads in basic block B12) if B12 is actually forked at run-time. These spill loads can be optimized away 
using the SKIP instruction, as explained before. 

7. Speculating across control dependent blocks 

In the preceding discussion, forks have only been used for parallel execution of control independent blocks. One 
can further extend the notion to include control dependent blocks. This further implies the ability to do both branch 
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paths speculatively. None of these speculations require further impact on the architecture, although there are addi- 
tional implementation costs involved. Additional usefulness of this form of speculation, which to some extent (along 
one branch path) is already in use in current speculative superscalar, needs further examination. Example used in 
Figure 6 illustrates the use of fork and suspend instructions for speculating across control dependent blocks, such 
5 as blocks B200 and B300, which are both control dependent on B100. The forks in block B100 also let one specu- 
late along both branch paths and appropriately discard instructions based on the actual control flow (either to B200 
or B300) at run time. 

8. Simplified thread management 

10 

O INTER-THREAD SYNCHRONIZATION: 

The notion of a unique main thread and remaining future threads, offers a amplified mechanism of inter-thread 
synchronization, implying low overhead. At explicit suspension points, future threads simply suspend them- 
selves and wait for the main thread control to reach them. Alternatively, at different points during its execution, 
75 a future thread can attempt explicit inter-thread synchronization with any other thread. But this more elaborate 

inter-thread synchronization implies more hardware/software overhead. 

o INTER-THREAD COMMUNICATION: 

The notions of forking with a copy of the architected machine state and the merge operation explained before, 
20 offer a mechanism of inter-thread communication with low overhead. Alternative mechanisms with much 

higher overhead can offer explicit communication primitives which provide continuous communication protocol 
between active threads, for example, via messages. 

O THREAD SCHEDULING: 

ss The mechanisms proposed in this invention which result in the optional nature of the FORKs (as explained 

before) also simplify dynamic thread scheduling, as the run-time thread scheduling hardware is not required to 
schedule (fork) a thread In response to a FORK instruction. Hence, the thread-scheduling hardware does not 
need to be burdened with queueing and managing the future thread(s) implied by every FORK instruction. This 
lowered hardware overhead of dynamic thread scheduling make it more appealing with respect to the static 

30 thread scheduling, due to its other benefits, such as, its adaptability to different machine implementations with- 

out recompilation. 

Claims 

35 1. A central processing apparatus in a computer comprising: 

a. an instruction cache memory having a plurality of instructions, the instruction cache further having one or 
more instruction cache ports; 

40 b. a program counter bank of more than one program counter, each program counter capable of independently 

addressing one or more instructions in the instruction cache, and porting the addressed instructions to one of 
the instruction cache ports; 

c. a dispatcher bank of more than one dispatcher, each dispatcher having an instruction buffer and each cfis- 
45 patch er being capable of receiving instructions from one or more of the instruction cache ports, placing the 

received instruction in its instruction buffer, decoding the instructions, and analyzing dependencies among the 
instructions in its associated buffer; 

d. a thread management unit that forks one or more threads and handles zero or more inter-thread communi- 
50 cations, each thread having a sequence of instructions executed using one of the program counters; 

e. a scheduler that receives instructions from all the dispatchers and schedules the instructions for execution 
on one or more functional units; and 

55 f. a register file including a fixed set of one or more architected registers accessible by instructions in every 

thread, whereby one or more instruction threads are executed by the functional units in parallel. 

2. An apparatus, as in claim 1 , where one of the program counters in the program counter bank tracks instructions in 
a main thread, the main thread being the thread earliest in a sequential trace order. 
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3. An apparatus, as in daim 2, where one or more of the dispatchers speculates on one or more dependencies that 
the dispatcher cannot resolve during the analysis, and the thread management unit is capable of determining 
whether one or more instructions executed by any of the future threads need to be discarded due to violations of 
program dependencies, as a consequence of one or more speculations, and the thread management unit discards 

5 these violating instructions. 

4. An apparatus, as in claim 3. where the thread management unit may fork a future thread starting at the specified 
address, when a FORK instruction is encountered, the FORK instruction being inserted in the instruction thread at 
compile time and the FORK instruction identifying the beginning of one or more of the future threads. 

10 

5. An apparatus, as in claim 4, where the FORK instruction includes an op-code field and one or more address fields, 
each address identifying the beginning location of a future thread. 

6. An apparatus, as in daim 5. where the FORK instruction op-code field includes bits 0 through 5, the address field 
is includes bits 6 through 29. and the extended op-code f ield indudes bits 30 and 31 . 

7. An apparatus, as in daim 3, where the the thread management unit may fork a future thread starting at the speci- 
fied address, when the FORK instruction is encountered, and unconditionally suspends the future thread when a 
UNCOND_SUSPEND instruction is encountered, the FORK and UNCOND_SUSPEND instructions being inserted 

20 at compile time. 

a An apparatus, as in daim 7, where the FORK instruction indudes an op-code field and one or more address fields, 
each address identifying the beginning location of a future thread and the UNCOND_SUSPEND instruction 
includes an op-code field. 

26 

9. An apparatus, as in daim 8. where the FORK instruction op-code field indudes bits 0 through 5, the address fieid 
includes bits 6 through 29. the extended op-code field indudes bits 30 and 31 , and the UNCOND_SUSPEND op- 
code has a primary op-code field induding bits 0 through 5 and an extended op-code field including bits 21 through 
31. 

30 

10- An apparatus, as in claim 7, having one or more SUSPEND instructions, the SUSPEND instruction being encoun- 
tered during the execution of one of the future threads and the thread management unit discarcfing the results of 
the set of instructions in the dependence region assodated with the SUSPEND instruction, if a compile-time spec- 
ified condition associated with the SUSPEND instruction evaluates to false at run time, the SUSPEND instructions 
36 bang inserted at compile time. 

11 . An apparatus, as in dam 10, where the SUSPEND instruction includes a SUSPEND op-code field, a mode-bit, and 
a condition field. 

40 1 2- An apparatus, as in daim 1 1 , where the SUSPEND op-code has a primary op-code field induding bits 0 through 5, 
a mode field occupying bit 6, and a condition field occupying bits 6 through 20. consisting of seven condition sub- 
fields, each 2 bits long, and an extended op-code field induding bits 21 through 31. 

13. An apparatus, as in daim 3, where tfiread management unit may fork a future thread starting at the specified 
45 address, when a FORK_SUSPEND instruction is encountered, the FORK_SUSPEND instruction being inserted in 
the instruction thread at compile time and the FORKSUSPEND instruction being capable of identifying one or 
more sets of instructions each set of instructions optionally having assodated conditions determining the valid exe- 
cution of the respective set of instructions. 

so 1 4. An apparatus, as in daim 1 3, where the FORKSUSPEND instruction indudes an op-code field, an address field, 
and one or more condition fields, each condition field having a count field and one or more conditions. 

15. An apparatus, as in daim 14. where the FORK.SUSPEND instruction has an op-code induding bits 0 through 5, 
a first condition field having a first count field including bits 6 through 8 and two conditions associated with the first 
ss count field induding bits 9-1 0 and 1 1 -1 2 respectively, a second condition field having a second count field including 
bits 13 through 15. two conditions assodated with the second count field including bits 16-11 and 18-19 respec- 
tively, an address field including bits 20 through 29, and an extended op-code field induding bits 30 and 31 . 
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16. An apparatus, as in daim 10, where upon encountering a SKIP instruction, the future thread decodes a number of 
instructions specified by the SKIP instruction and assumes the execution of the identified instructions without per- 
forming the execution. 

s 17. An apparatus, as in daim 16 where the SKIP instruction includes an op-code field, and a count field. 

18. An apparatus, as in claim 3, where the thread management unit may fork a future thread starting at the specified 
address, when a FORKS_SUSPEND instruction is encountered, the FORK_S_SUSPEIMD instruction being 
inserted in the instruction thread at compile time and the FORK_S_SUSPEND instruction being capable of identi- 
10 tying one or more sets of instructions each set of instructions optionally having associated conditions determining 
the valid execution of the respective set of instructions and further having a skip count f ieki identifying a number of 
instructions at the start of the thread, and assumes the execution of the identified instructions without performing 
the execution. 

is 19. An apparatus, as in daim 18, the FORK_S_SUSPEND instruction indudes an op-code field, an address field, a 
skip count field, and one or more condtion fields each condition f ield having a count field and one or more condi- 
tions. 

2a An apparatus, as in claim 3, where thread management unit may fork a future thread starting at the specified 
20 address, when a FORK_M_SUSPEND instruction is encountered, the FORK_M_SUSPEND instrudion being 
inserted in the instruction thread at compile time, and the FORK_M__SUSPEND instruction being capable of iden- 
tifying a set of register masks, each mask identifying a subset of architected registers which hold valid source oper- 
ands, provided the conditions, rf any, associated with the made hold at run time. 

25 21. An apparatus, as in claim 22, where the FORK_M_SUSPEND instruction includes an op-code field, an address 
field, and one or more condition fields, each condition field having a register mask, and one or more conditions. 

22. An apparatus, as in daim 10. where upon encountering an FSKIP instruction, the future thread dispatcher skips the 
fetch, and hence the execution of a specified number of instructions following this instruction, and the FSKIP 

30 instruction being capable of identifying a register mask specifying the set of architected registers which hold valid 
operands, the main thread dispatcher treating this as a NOP, and the FSKIP instruction being inserted in the 
instruction thread at compile time. 

23. An apparatus, as in claim 22, where the FSKIP instruction indudes an op-code field, a maskfield, and a count field. 

35 

24. An apparatus, as in daim 10, where upon encountering a SKPMQ instrudion, a future thread decodes a number 
of instructions specified by the SKPMG instruction and assumes the execution of the identified instructions without 
performing the execution, and the main thread dispatcher treats this instrudion as a marker for the starting address 
of a potential future thread, the SKPMG instruction being inserted in the instruction thread at compile time. 

40 

25. An apparatus, as in daim 24, where the SKPMG instruction indudes an op-code field, and a count field. 

26. An apparatus, as in daim 1 , where the thread management unit can optionally fork. 

45 27. An apparatus, as in daim 1, where the instruction cache is replaced by a main memory. 

28. A method of executing instructions on a computer system with a central processing apparatus, comprising the 
steps of: 

so a. Generating a static sequence of instructions at compile time, and analyzing the static sequence of instruc- 

tions to determine a set of fork points; 

b. Inserting zero or more FORK instructions at zero or more of the fork points, at compile-time; 

65 c. Loading the static sequence of instructions into a main memory starting at a fixed location in the memory 

and transferring a subsequence of the static sequence to an instruction cgche; 
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d. Fetching the instruction sequence from the instruction cache by addressing the sequence through a main 
program counter starting at a current address, as a main thread, and checking to determine if there is one or 
more unmerged future threads starting at the current address; 

5 e. Checking the validity of the unmerged future threads; 

f. Merging the validly exceutued portions of the zero or more unmerged future threads into the main thread; 

g. Decoding the fetched instructions hi a dispatcher, and checking to see if one or more of the instructions are 
10 decoded as a FORK instruction; 

h. For instructions decoded as other than FORK instructions, executing the main thread by analyzing instruc- 
tion dependencies, and by scheduling the instructions for execution on appropriate functional units; 

15 l Completing instruction execution through a completion unit, and repeating steps (d) through this step; 

j. For an instruction decoded as a FORK instruction, checking to determine the availability of machine 
resources for forking an additional future thread; 

so k. Forking a future thread, if there are resources available, by loading the address associated with the FORK 

instruction into a future program counter; and 

I. Executing a future thread in parallel with the forking thread execution, by performing steps (d) through (h) by 
using one of the future program counters and one of the future thread dispatchers, instead of the main program 
25 counter and the main thread dispatcher, respectively, and suspending the future thread execution if the future 

thread is merged with the main thread or the future thread is killed by the thread management unit 

29. A method as in claim 28. where, 

so a. Step (b) has the following additional substep: 

o STEP B.1: 

Inserting an UNCOND.SUSPEND instruction at the end of every future thread; 
3s b. Step (I) has the following additional substep: 

o STEPL1: 

Suspending a future thread execution upon encountering the UNCOND_SUSPEND instruction; and 
40 c. Step (h) has the following additional substep: 

o STEP H.1: 

Treating the UNCOND_SUSPEND instruction as NOP, if encountered for execution by a thread other than 
its corresponding future thread; 

45 

30. A method as in claim 29, where, 

a. Step (b) has the following additional substep: 

so o STEP B.2: 

Inserting zero or more SUSPEND instruction corresponding to every UNCOND_SUSPEND instruction; 

b. Step (I) has the following additional substep: 

55 o STEP L2: 

Discarding the set of instructions in the dependence region associated with the SUSPEND instruction, if 
the compile-time specified condition associated with the SUSPEND instruction evaluates to false at run 
time; and 
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c. Step (h) has the following additional substep: 
o STEP H.2: 

Treating the SUSPEND instruction as NOP, if encountered for execution by a thread other than its corre- 
sponding future thread; 

31. A method as in daim 30, where, 

a. Step (b) has the following additional substep: 

0 STEP B.3: 

Inserting zero or more SKIP instruction in a future thread; 

b. Step 0) has the following additional substep: 
o STEP L3: 

Decoding a specified number of instructions following the SKIP instruction, and assuming the execution of 
these specified number of instructions without performing the execution, during execution as a future 
thread; and 

a Step (h) has the following additional substep: 
o STEP K3: 

Treating the SKIP instruction as NOP, if encountered for execution by a thread other than its corresponding 
future thread; 

32. A method as in daim 30, where, 

a. Step (b) has the following additional substep: 

O STEP B.4: 

Inserting zero or more FSKIP instruction in a future thread; 

b. Step (I) has the following additional substep: 

35 

o STEP L4: 

Skipping the fetch of a specified number of instructions following the FSKIP instruction during execution as 
a future thread, and marking the registers identified in the associated mask as holding valid operands; 

40 a Step (h) has the following additional substep: 

o STEP K4: 

Treating the FSKIP instruction as NOP, if encountered for execution by a thread other than its correspond- 
ing future thread; 

45 

33. A method as in claim 30, where, 

a. Step (b) has the following additional substep: 

so o STEP B.5: 

Inserting a SKPMG instnjction at the start of every future thread; 

b. Step (I) has the following additional substep: 

55 o STEP L5: 

Decoding a specified number of instructions following the SKIP instruction, and assuming the execution of 
these specified number of instructions without performing the execution, during execution as a future 
thread; 
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c. Step (h) has the following additional substep: 
o STEP H.5: 

Checking to determine If a future thread has been forked in the past starting at the instruction address of 
the SKPMG instruction, if a SKPMG is encountered for execution by a thread other than its corresponding 
future thread; 

d. Step (d) is replaced by the following step: 
o STEP D.1: 

Fetching the instruction sequence from the instruction cache by addressing the sequence through a main 
program counter, 

34. A method as in claim 28, where, 

a. Step (b) is replaced by the following step: 
o STEP B.6: 

Inserting zero or more FORK_SUSPEND instructions at zero or more of the potential fork points; 

b. Step fl) is replaced by the following step: 
o STEP J.1 : 

For an instruction decoded as a FORK_SUSPEND instruction, checking to determine the availability of 
machine resources for forking an additional future thread; 

c. Step (k) is replaced by the following step: 
o STEP K.1: 

Forking a future thread, if there are resources available, by loading the address(es) associated with the 
FORK_SUSPEND instruction into future program counters); 

d. Step (I) has the following additional substep: 
o STEP L6: 

Discarding the results of some or all instructions in the future thread if the associated compile-time speci- 
fied conditions do not hold true at run time; 

35. A method as in claim 28, where, 

a. Step (b) is replaced by the following step: 
o STEP B.7: 

Inserting zero or more FORK_S_SUSPEND instructions at zero or more of the potential fork points; 

b. Step Q) is replaced by the following step: 
o STEP J.2: 

For an instruction decoded as a FORK_S_SUSPEND instruction, checking to determine the availability of 
machine resources for forking an additional future thread; 

c. Step (k) is replaced by the following step: 
o STEP K.2: 

Forking a future thread, if there are resources available, by loading the address(es) associated with the 
FORK_S_SUSPEND instruction into future program counters); 

d. Step (I) has the following additional substep: 
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o STEP L7: 

Decoding a specified number of instructions at the start of the future thread, and assuming the execution 
of the specified number of instructions without performing the execution of these instructions, and discard- 
ing the results of some or all of the instructions in the future thread if the associated compile-time specified 
5 conditions do not hold true at run time; 

36. A method as in claim 28. where, 

a. Step (b) is replaced by the following step: 

10 

o STEP B.8: 

Inserting zero or more FORKJvl_SUSPEND instructions at zero or more of the potential fork points; 

b. Step (j) is replaced by the following step: 

15 

o STEP J.3: 

For an instruction decoded as a FORK_M_SUSPEND instruction, checking to determine the availability of 
machine resources for forking an additional future thread; 

20 c. Step (k) is replaced by the following step: 

o STEP K.3: 

Forking a future thread, if there are resources available, by loading the address(es) associated with the 
FORK_M_SUSPEND instruction into future program counters); 

25 

d. Step (I) has the following additional substep: 
O STEP L8: 

Discarding the results of some or all of the instructions in the future thread if the comple-time specified 
so conditions associated with the source regsiter operands of the instructions do not hold true at run time; 

37. A method as in claim 28, where, 

a. Step (h) has the following additional substep: 

35 

0 STEP H6: 

Communicating every branch resolution during a thread execution to the TM unit and the TM unit using 
this information to determine if a future thread forked to the incorrect branch address, and any dependent 
threads, need to be discarded; 

40 

3a A method as in claim 28, where, 

a. Step (d) has the following additional substep: 

45 o STEP D.2: 

TM unit checking to determine if any of the previously forked threads has stayed unmerged for longer than 
a pre-specffied time-out period, and clscarding any such thread; 
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